Towards the Classification of the Finnish Internet Parsebank: Detecting Translations and Informality
نویسندگان
چکیده
This paper presents the first results on detecting informality, machine and human translations in the Finnish Internet Parsebank, a project developing a large-scale, web-based corpus with full morphological and syntactic analyses. The paper aims at classifying the Parsebank according to these criteria, as well as studying the linguistic characteristics of the classes. The features used include both lexical and morpho-syntactic properties, such as syntactic n-grams. The results are practically applicable, with an AUC range of 85–85% for the human, ∼ 98% for the machine translated texts and 73% for the informal texts. While word-based classification performs well for the indomain experiments, delexicalized methods with morpho-syntactic features prove to be more tolerant to variation caused by genre or source language. In addition, the results show that the features used in the classification provide interesting pointers for further, more detailed studies on the linguistic characteristics of these texts.
منابع مشابه
Creating register sub-corpora for the Finnish Internet Parsebank
This paper develops register sub-corpora for the Web-crawled Finnish Internet Parsebank. Currently, all the documents belonging to different registers, such as news and user manuals, have an equal status in this corpus. Detecting the text register would be useful for both NLP and linguistics (Giesbrecht and Evert, 2009) (Webber, 2009) (Sinclair, 1996) (Egbert et al., 2015). We assemble the subc...
متن کاملSyntactic N-gram Collection from a Large-Scale Corpus of Internet Finnish
In this paper, we report on the development of a large-scale Finnish Internet parsebank, currently consisting of 1.5 billion tokens in 116 million sentences. The data is fully morphologically and syntactically analyzed and it has been used to extract flat and syntactic n-gram collections, as well as verb-argument and nounargument n-grams. Additionally, distributional vector space representation...
متن کاملThe Accuracy of Body Mass Index and Gallagher’s Classification in Detecting Obesity among Iranians
Background: The study was conducted to examine the comparability of the BMI and Gallagher’s classification in diagnosing obesity based on the cutoff points of the gold standards and to estimate suitable cutoff points for detecting obesity among Iranians.Methods: The cross-sectional study was comparative in nature. The sample consisted of 20,163 adults. The bioelectrical impedance analysis (BIA)...
متن کاملLinguistic Analysis of the Main Traits of Stream of Consciousness in the Persian Translations of Virginia Woolf's Mrs. Dalloway and James Joyce's A Portrait of the Artist as a Young Man
This study investigated how the main linguistic traits of stream of consciousness novels are realized in Persian translations and also the frequency of translation strategies used by translators. Accordingly, a restricted set of linguistic parameters which Totò (2014) asserts can show the stream of thought of character(s), is chosen including punctuation, exclamatory utterances, interjections, ...
متن کاملQur’anic Metaphors and Their English and Persian Translations: Dead or Alive?
The present study aims at discussing whether metaphors in the Qur’an, revealed more than 1400 years ago, are dead, moribund or live and how these three types of metaphors have been translated in three English and three Persian translations of the Qur’an. The results reveal that among 70 metaphors examined, while only about 32.85% are live metaphors, about 67.14% are moribund, but none of the ca...
متن کامل